{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对数据进行删除、补充或者修改，洗掉“杂质”，主要可以考虑这些方法：    \n",
    "\n",
    "（1）删除数据；    \n",
    "\n",
    "（2）空值填充；   \n",
    "\n",
    "（3）数据盖帽；   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.数据删除   \n",
    "\n",
    "对于要删除的数据可以从**饱和度、稳定性、相关性、低方差**来考虑\n",
    "\n",
    "##### 饱和度\n",
    "\n",
    "饱和度过低（空值太多）的数据显然对于建模帮助不大，这可以从两个方向来考虑：   \n",
    "\n",
    "（1）按列：对饱和度过低的特征去掉；   \n",
    "\n",
    "（2）按行：如果单条数据的饱和度过低，也可以考虑去掉     \n",
    "\n",
    "##### 稳定性\n",
    "\n",
    "而特征的稳定性对于某些建模业务影响也很大，这可以通过T+1的数据分布来看，比如相邻两个月的数据分布差异很大，可以考虑去掉这列特征，它可能会对建模起到副作用，对连续值的分布可以考虑均值、IQR、分位数、方差等等，对离散值可以考虑频率、频数等     \n",
    "\n",
    "\n",
    "##### 相关性\n",
    "\n",
    "相关性是指与y的相关性较强的特征，这里主要需要注意的是数据泄露的情况，可以分为如下的两种情况，x产生于y之前，y可能直接或者间接由x加工得到，由于预测阶段x也会在预测之前生成，所以这时的x起着正向作用，应该保留；而另外一种情况就恰好相反了，如果x是由y加工而来的，那么在预测阶段x其实是空值，这时起着负向作用，应该删除：     \n",
    "\n",
    "（1）x=>y(保留);   \n",
    "（2）y=>x(删除);  \n",
    "\n",
    "#### 低方差\n",
    "\n",
    "低方差意味着特征取值基本都一样，显然无用\n",
    "\n",
    "PS：不要忘记了那些虽然有值，但与空值等价的数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.空值填充\n",
    "\n",
    "对于剩下的NaN值就要考虑填充的问题了，大概分为如下几种：   \n",
    "\n",
    "（1）填充某一定值；   \n",
    "\n",
    "（2）填充某统计量：均值、中位数、众数；   \n",
    "\n",
    "（3）依赖性填充：操作的不多"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "train_df=pd.read_csv(\"./titanic/train.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PassengerId      0\n",
       "Survived         0\n",
       "Pclass           0\n",
       "Name             0\n",
       "Sex              0\n",
       "Age            177\n",
       "SibSp            0\n",
       "Parch            0\n",
       "Ticket           0\n",
       "Fare             0\n",
       "Cabin          687\n",
       "Embarked         2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#定值填充\n",
    "train_df['Cabin'].fillna('missing',inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#众数填充\n",
    "train_df['Embarked'].fillna(train_df['Embarked'].mode()[0],inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>38.233441</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>29.877630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>25.140620</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Pclass        Age\n",
       "0       1  38.233441\n",
       "1       2  29.877630\n",
       "2       3  25.140620"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Age与Pclass具有比较强的相关性，利用Age所属Pclass组的均值对其缺失值进行填充\n",
    "train_df.groupby('Pclass')['Age'].mean().reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fill_age(pclass,age):\n",
    "    if np.isnan(age):#注意：如果当前列为float/int类型，当前列中的None会被强制转为float的nan类型\n",
    "        if pclass==1:\n",
    "            return 39.159930\n",
    "        elif pclass==2:\n",
    "            return 29.506705\n",
    "        else:\n",
    "            return 24.816367\n",
    "    else:\n",
    "        return age\n",
    "train_df['Age']=train_df.apply(lambda row:fill_age(row['Pclass'],row['Age']),axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PassengerId    0\n",
       "Survived       0\n",
       "Pclass         0\n",
       "Name           0\n",
       "Sex            0\n",
       "Age            0\n",
       "SibSp          0\n",
       "Parch          0\n",
       "Ticket         0\n",
       "Fare           0\n",
       "Cabin          0\n",
       "Embarked       0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "也可以多种填充策略组合，比如先依赖于相关性高的进行填充，然后利用均值/中位数等填充，然后fillna一个确定的值等，另外sklearn.preprocessing.Imputer也可方便进行均值、中位数、众数填充，但是，这里有个需要画线的内容，那还是**数据泄露**问题：    \n",
    "\n",
    "**利用均值、中位数、众数、依赖性填充时，避免用到验证集/测试集的数据，且验证集、测试集的填充也要用训练集的数据**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.盖帽\n",
    "盖帽操作可以在一定程度上去掉一些异常点，对提高模型的泛化能力有好处，但是盖多了又会影响原始数据的分布，盖多盖少，酌情使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "#将数值低于5%分位数的设置为5%分位数值，高于95%分位数的设置为95%分位数值\n",
    "low_thresh=5\n",
    "high_thresh=95\n",
    "def cap_floor(low_thresh,high_thresh,train_df):\n",
    "    cap_dict={}\n",
    "    for column in train_df.columns:\n",
    "        if train_df[column].dtype==object:\n",
    "            continue\n",
    "        low_value=np.percentile(train_df[column],low_thresh)\n",
    "        high_value=np.percentile(train_df[column],high_thresh)\n",
    "        if low_value==high_value:#这里相当于不进行盖帽\n",
    "            low_value=np.min(train_df[column])\n",
    "            high_value=np.max(train_df[column])\n",
    "        cap_dict[column]=[low_value,high_value]\n",
    "    return cap_dict\n",
    "#更新\n",
    "def cap_update(column,x,cap_dict):\n",
    "    if column not in cap_dict:\n",
    "        return x\n",
    "    if x>cap_dict[column][1]:\n",
    "        return cap_dict[column][1]\n",
    "    elif x<cap_dict[column][0]:\n",
    "        return cap_dict[column][0]\n",
    "    else:\n",
    "        return x\n",
    "cap_dict=cap_floor(low_thresh,high_thresh,train_df)\n",
    "for column in train_df.columns:\n",
    "    train_df[column]=train_df[column].apply(lambda x:cap_update(column,x,cap_dict))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'PassengerId': [45.5, 846.5],\n",
       " 'Survived': [0.0, 1.0],\n",
       " 'Pclass': [1.0, 3.0],\n",
       " 'Age': [6.0, 54.0],\n",
       " 'SibSp': [0.0, 3.0],\n",
       " 'Parch': [0.0, 2.0],\n",
       " 'Fare': [7.225, 112.07915]}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cap_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了方便后面代码复用，将这一节的代码封装到utils.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
